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Abstract 

Tannier et al. introduced a generalization of breakpoint distance for multichromosomal genomes. They showed that 
the median problem under the breakpoint distance is solvable in polynomial time in the multichromosomal circular 
and mixed models. This is intriguing, since in all other rearrangement models (DCJ, reversal, unichromosomal or 
multilinear breakpoint models), the problem is NP-hard. The complexity of the small or even the large phylogeny 
problem under the breakpoint distance remained an open problem. 

We improve the algorithm for the median problem and show that it is equivalent to the problem of finding maxi- 
mum cardinality non-bipartite matching (under linear reduction). On the other hand, we prove that the more general 
small phylogeny problem is NP-hard. Surprisingly, we show that it is already NP-hard (or even APX-hard) for 
4 species (a quartet phylogeny). In other words, while finding an ancestor for 3 species is easy, already finding two 
ancestors for 4 species is hard. 

We also show that, in the unichromosomal and the multilinear breakpoint model, the halving problem is NP-hard, 
thus refuting the conjecture of Tannier et al. Interestingly, this is the first problem which is harder in the breakpoint 
model than in the DCJ or reversal models. 
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1. Introduction 

While point mutations change the genome sequence of species throughout the evolution, there are also large scale 
rearrangement mutations, such as inversions or translocations, which affect the order of genes in genome. The gene 
order data can be used for inferring phylogenetic relationships and for reconstructing phylogenies |r|. A related 
problem is the reconstruction of ancestral gene orders, which is key to understanding the underlying evolutionary 
processes. 

The simplest model for studying gene orders is the breakpoint model introduced by Sankoff and Blanchette ||2l. 
When two genes (or conserved segments or markers) are adjacent in one genome, but not in the other, we call this 
position a breakpoint. We can then define the breakpoint distance simply by counting the number of breakpoints. 

Sankoff and Blanchette |2 1 tried to reconstruct the ancestral gene orders, given a phylogenetic tree and gene orders 
of the extant species, based on the parsimony criterion, i.e., by minimizing the sum of distances along the branches 
of the tree. This is known as the small phylogeny problerrQ Unfortunatelly, the problem is NP-hard already when we 
have 3 species - an important special case known as the median problem. In fact, the median problem turns out to be 
NP-hard for almost all rearrangement distances (breakpoint |i3jt5J, reversal ||6J, and DCJ |5|). 

One notable exception is the general breakpoint model. Tannier et al. 15] observed that if we drop the condition 
that genomes are unichromosomal and that all chromosomes are linear, we get a simple model where the median 
problem is solvable in polynomial time. Even though this model is not very biologically plausible and more realistic 
models exist, the breakpoint model may still be useful for upper and lower bounds and solutions in this model may 
serve as good starting points for the more elaborate and complicated models. 
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In this paper, we complete the work started by Tannier et al. [5 1 on the breakpoint model. We study several 
rearrangement problems in different variants of the breakpoint model and settle their computational complexity. 

1.1. Previous results and our contribution 

There are several variants of the breakpoint model depending on what karyotypes do we allow. In the unichro- 
mosomal (linear or circular) model, the genome may only consist of one chromosome. In the multilinear model, the 
genome may consist of multiple linear chromosomes and finally, the mixed model allows for any number of linear 
and circular chromosomes (even though this is not biologically plausible). 

For the unichromosomal model, Pe'er and Shamir |3| and Bryant |4| showed that the median problem is NP-hard. 
This result was extended to the multilinear model by Tannier et al. |5| and Zheng et al. |7| showed the NP-hardness 
for a related problem called guided halving (see Preliminaries). 

Curiously, the ordinary halving problem was not studied before in the breakpoint model and also Tannier et al. 
lis] leave it open. Moreover, they conjecture that the problem is polynomially solvable - this might perhaps be 
attributed to the fact that the halving problem is polynomially solvable in far more complicated models such as 
reversal/translocation (RT) fE\ or double cut and join (DCJ) |9-12|. Nevertheless, we refute this conjecture (unless 
P = NP) by proving that the halving problem is NP-complete in the unichromosomal and multilinear models. 

Our main contribution is, however, our work in the general (mixed) model. Tannier et al. |5 1 introduced this model 
and showed that both median, halving, and guided halving problems are solvable in polynomial time. 

Two open questions remained in the work of Tannier et al. |i5J. These are also articulated in the monograph by 
Fertinet al. |13|: 

1 . The best time complexity for the median and guided halving problems under the breakpoint distance 
on multichromosomal genomes (with circular chromosomes allowed) is 0(n^), using a reduction to the 
maximum weight perfect matching problem. It is an open problem to devise an ad-hoc algorithm with 
better complexity. 

2. The small parsimony problem and large parsimony problem under the breakpoint distance is open 
regarding multichromosomal signed genomes when linear and circular chromosomes are allowed. 

We resolve the first question in a positive way by showing a more efficient algorithm running in 0(n ■\/n) time. This is 
by reduction to the maximum cardinality matching problem. Moreover, we show that maximum cardinality matching 
can be reduced back to the breakpoint median (by a linear reduction) and so the two problems have essentially the 
same complexity. The same technique also improves the algorithms for halving and guided halving. 

The second question is resolved in a negative way. Surely, one could expect that the large parsimony problem 
is NP-hard for this model, since it is NP-hard even for the Hamming distance on binary strings |14|. However, 
surprisingly, for the breakpoint distance (unlike the Hamming distance), the small phylogeny is NP-hard and it is 
NP-hard even for 4 species, i.e., a quartet phylogeny. In other words, while the small phylogeny problem is easy for 
3 species, it is hard already for 4 species. 

The previous work and our new results are summarized in Table [T[ 



Table 1: Our new results in context of the previously known results. NP-C stands for NP-complete. 
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NP-C [new] 
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1.2. Road map 

In the next section, we define the different variants of the breakpoint model and state the rearrangement problems. 
In Section 3, we refute the conjecture of Tannier et al. fSl and prove that the halving problem is NP-hard. In the 
following two sections, we study the general breakpoint model. In Section 4, we look at the median problem: we 
improve upon the algorithm of Tannier et al. |5| and show that it is equivalent to the maximum matching problem. 
The hardness of the small phylogeny problem is studied in Section 5 and we conclude in Section 6. 



2. Preliminaries 

2.1. Genome models and the breakpoint distance 

We assume that all the genomes have the same gene content and we denote this set of genes by Q. We also 
assume that each gene g e ^ is an oriented segment of DNA having two ends - a head and a tail. These two ends are 
called extremities and are denoted gk and g,, respectively. Let us first describe the circular models which are the most 
used throughout the paper, since they are easier to work with. We then extend our definitions to account for linear 
chromosomes. 

We represent genome tt by a set of edges: An edge between extremities x and y, called adjacency, indicates that 
X and y are adjacent in the genome. Note that in circular genomes, every extremity is adjacent to exactly one other 
extremity, so we can identify genomes with perfect matchings over the set of extremities. 

Let us define an auxiliary base matching B - [ghgt '■ g ^ &] where each edge connects the two ends of some 
gene. Then all vertices have degree 2 in the union tt U B and nVJB decomposes into a set of cycles, which naturally 
correspond to the circular chromosomes of our genome (see Fig.[T]). 

In the general (multichromosomal circular) model, genomes can have multiple circular chromosomes and any 
perfect matching n corresponds to a genome. In the unichromosomal circular model, we require that the genome only 
consists of a single chromosome, so ;7r U Z? is a Hamiltonian cycle as in Fig.[T] Such a matching n is sometimes called 
Hamiltonian matching. 




(a) The order of genes in a genome. Each arrow corre- (b) Representation of the genome on the left by a perfect matching. The green 
sponds to a single gene with known orientation. edges are the adjacencies of the gray edges form the base matching B. The 

Hamiltonian cycle nVJ B corresponds to the single chi'omosome. 



Figure 1 : Example of a circular genome n and its representation by a perfect matching. 



Let Til and 112 be two genomes - two perfect matchings. Then the breakpoint distance between tti and 1x2 is defined 

as 

d(7Ti,7T2) — n — sim(7ri , 712), 

where n is the number of genes and sim(;ri,:7r2) is the number of common adjacencies. The breakpoint distance 
satisfies all the properties of a metric and is used in the literature, however, we find it easier to work directly with the 
similarity measure sim(7ri,7r2)- 
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To represent linear chromosomes, we add a vertex Tj for each extremity x. These vertices are called telomeres 
and a telomeric adjacency xl ^ indicates that x is an end of a linear chromosome (see Fig. [2]). 

Genomes will again correspond to matchings with a condition that Tj may only be adjacent to x. If n is such a 
matching, n \J B consists of cycles and paths ending with telomeres, which correspond to circular and linear chro- 
mosomes, respectively. In the mixed model, any such matching n represents a genome; in the multilinear model, we 
require that every chromosome is linear; and in the linear model, we only allow a single linear chromosome. 

We can write the breakpoint distance again in the form d(7Ti , 712) - n - sim(7ri , JI2), where this time, sim(7ri , JI2) is 
the number of common adjacencies plus half the number of common telomeric adjacencies (as introduced by Tannier 
et al. i5J). 



1-3 4 2 5 6 




(a) A genome with 2 linear and 1 circular chromosome. Such 
genomes are not found in nature, however, the model is motivated 
by tractability of the rearrangement problems such as median. 




(b) The same genome represented as a set of adjacencies (green match- 
ing). Gray edges form the base matching B. Components of U B are 
paths and cycles, corresponding to linear and circular chromosomes, 
respectively. 



Figure 2: Example of a mixed genome n and its representation. 



2.2. Duplicated genomes 

We will also work with duplicated genomes that underwent a whole genome dupUcation and have exactly two 
copies of each gene. For each gene g, let us label the first copy g^ and the second copy g^. Then we can represent a 
duplicated genome by an ordinary genome 6 over the gene set [g^ ,g^ : g e Q]. However, note that the labels were 
introduced arbitrarily and we consider two genomes that differ only in the subscripts of some genes as equivalent. A 
duplicated genome actually corresponds to the equivalence class [S]. 

We can define the breakpoint distance (similarity) between two duplicated genomes [7] and [5\ as the minimum 
distance (maximum similarity) between ordinary genomes y' e [y] and 5' € [6]. In fact, we can fix one y' e [7] and 
take the minimum (maximum) over 6' G {S\. 

Let us write 6 - tt® n for a perfectly duplicated genome - the result of a whole genome duplication. For each 
linear chromosome in tt, contains two copies of the chromosome and for each circular chromosome in 7:, 6 contains 
either two copies of the chromosome or one chromosome consisting of the two copies consecutively. The distance 
between an ordinary genome n and a duplicated genome [S], also called double distance and denoted ddin, S), is then 
the distance between n® tt and [6]. 

We say that n and [6] have adjacency xy in common, if x,y are adjacent in n and x',y^ are adjacent in 6 for some / 
and j. We say that they have the adjacency xy twice in common, if either x'y' and x^y^, or x'y^ and x^y' are adjacent 
in 6. Tannier et al. |5 1 showed that the double distance dd{7r, 6) can be computed simply as ddin, 5) -2n- sim(7r, 5), 
where sm\{n,5) is the number of adjacencies in common plus half the number of telomeric adjacencies in common 
(adjacencies twice in common are counted as 2). 

2.3. Rearrangement problems 

Once we have a genome model and a distance measure, we can define the problems of interest. In general, the 
focus of our study are problems related to reconstruction of ancestral genomes under the parsimony principle. 
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Assume that we have two genomes tti and 712 and we would like to reconstruct their common ancestor a. 
Using a third, outgroup genome jit,, we can formulate the task as the Median problem: Given n\, n2, and n-^, 
find genome a (called median) that minimizes the total distance from tti, n2, and 713. In the Breakpoint-Median 
problem, we are minimizing the breakpoint distance, which is the same as maximizing the median score S{a) - 
sim(a', ;7ri) + sim(Q', 7:2) + sim(Q', ;7r3). Note that the genome model imposes further constraints on the solution - the 
number and type of chromosomes. 

We can generalize the median problem to the median of k genomes problem, where given genomes n\, . . . ,nk,v^& 
should find genome a that maximizes the score S (a) - sim(Q', tt,). However, even more important generalization is 
the Small-Phylogeny problem, where we are given a phylogenetic tree and gene orders of the extant species (leaves 
of the tree). The task is to reconstruct all the ancestral genomes, i.e., to find gene orders for each internal vertex, 
while minimizing the sum of (breakpoint) distances along the edges of the phylogenetic tree. (This is the same as 
maximizing the sum of similarities along the edges.) The Median problem is a special case of the Small-Phylogeny 
problem with just 3 species (since there is only one unrooted tree with 3 leaves). On the other hand, median solvers 
are widely used in practice in the Steinerization approach to reconstruct the ancestors in Small-Phylogeny: Starting 
with some initial ancestral genomes, we repeatedly replace genomes by medians of the neighbouring genomes in 
the phylogeny, until we converge to some local optimum. Therefore, having a model where the Median problem is 
efliciently solvable might be of practical significance. 

Another classical problem in genome rearrangements is the Halving problem. Imagine a genome n that underwent 
a whole genome duplication. The perfectly dupUcated genome 6 - n® n was then rearranged through the evolution 
to its present-day form y. In the Halving problem, we would like to reconstruct the pre-duplication ancestor n given 
the present-day genome y. More precisely, we would hke to find an ordinary genome a that minimizes the double 
distance from 7. 

In the Halving problem, there are usually many equivalent solutions. For better results, we can use an ordinary 
outgroup genome p (such that the speciation happened before the whole genome duplication) and search for genome 
or that minimizes dd(a, 7) + d(a,p). This is called the Guided-Genome-Halving problem. 

3. The halving problem 

Bryant H showed that the median problem is NP-hard in the circular breakpoint model by reduction from the 
Directed-Hamiltonian-Cycle problem. The halving problem was not studied previously in the breakpoint model but 
we show that it suff'ers the same "Hamiltonian" curse as the median problem - in order to find the ancestor, we would 
in fact have to find a Hamiltonian cycle. Our proof is even simpler than that of Bryant |4 |. 

As the halving problem is polynomially solvable in more realistic models such as the RT model |8| or the DC J 
model ll9l- [T2l . the halving problem under the breakpoint distance will remain a mere curiosity: It is the first problem 
which is easier in the DCJ or even in the RT model than in the breakpoint model. Furthermore, it is the only known 
case where halving is NP-hard, while the double distance is computable in polynomial time (in the DCJ model, the 
opposite is true - halving is easy, while the double distance is NP-hard |5|). 

Theorem 1. Halving problem is NP-hard in the circular, linear, and multilinear breakpoint models. 

Proof. The proof is by reduction from the Directed-Hamejonian-Cycle problem. Plesnik ifTSl proved that this prob- 
lem is still NP-hard for graphs with maximum degree 2 and the construction implies the problem is also NP-hard if 
all the vertices have in-degree and out-degree 2. Note that such graphs have an Eulerian cycle. 

Let G - (V, E) be such a directed graph; the corresponding doubled genome 5 will have two copies of one gene 
for every vertex in G. We create a new graph G' = iV' ,E'), where V - {xj^, , x^, : x e V} and the edges in 
E' are defined as follows: traverse the Eulerian walk and for each edge xy e E, include an edge xj^y/ in E', where / 
and j depend on whether we are visiting the vertices for the first time or for the second time. Note that all edges go 
from head to tail, E' is a perfect matching, and G' represents the doubled genome 6 consisting of a single circular 
chromosome. 

Let ff be a circular genome, a solution to the halving problem. Note that 5 has no double adjacencies, so a can 
have at most n adjacencies in common (none are twice in common). This maximum can be attained if and only if all 
the adjacencies in a are of the form Xhy, (from head to tail) and for each such adjacency, xj^y/ is an adjacency in 6 for 
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some /, j. This is if and only if xy e E. So by contracting the base matching (each head and tail of a gene into a single 
vertex) and orienting the edges, we get a directed Hamiltonian cycle in G. 

For the linear and multilinear models, remove one edge xy from G and consider the problem of deciding whether 
G contains a directed Hamiltonian path. This problem is still NP-hard and can be reduced to the halving problem 
in the linear models: G now has an Eulerian path starting in y and ending in x. We replace the last adjacency x^yl 
in 6 (corresponding to the removed edge) by two telomeric adjacencies xjj^i and y/x^.i to get a linear genome. If a 
is a linear or multilinear solution to the halving problem, it can reach the maximum similarity if and only if all its 
adjacencies (including the telomeric adjacencies) are in common with 6 and this is if and only if contraction of a is a 
directed Hamiltonian path in G. □ 



4. Median and halving problems in the general model 

From now on, we will study the general breakpoint model, i.e., the multichromosomal circular model where 
genomes are perfect matchings. We will also note how to extend the results to the mixed model and use the developed 
techniques for the halving and guided halving problem. 

4.1. Breakpoint median 

Tannier et al. |5| noticed that finding a breakpoint median can be reduced to finding a maximum weight perfect 
matching. This can be done in O(n^) time by algorithm of Gabow [16 1 and Lawler [J/TJ. An open problem from 
Tannier et al. IS) and Fertin et al. lfT3l asks, whether this can be improved. We answer this question affirmatively by 
showing an 0(n ■\/n) algorithm. 

The solution by Tannier et al. |5 1 (if we rephrase it using the similarity measure instead of the breakpoint distance) 
was to create a complete weighted graph G where vertices are extremities and weight w(xy) of edge xy is the number 
of genomes which contain the adjacency xy. Any perfect matching a corresponds to some genome and the weight of 
the matching is equal to its median score S (a). 

Notice that instead of finding a maximum weight perfect matching, we can remove all the zero-weight edges 
from G and find an ordinary (not necessarily perfect) matching. We can then complete the genome by joining the 
free vertices arbitrarily. Since the number of edges in G is now linear, maximum weight matching can be found in 
0{n^ logn) time by algorithm of Gabow ifTSl or even in the state of art (5(n V") time algorithm by Gabow and Tarjan 
|fT9]| using the fact that the weights are small integers. More generally and more precisely: 

Theorem 2. The Breakpoint-Median problem for k genomes can be solved in Oikn \fn ■ log(fen) ■sja{kn, n) log n) time 
in the general model. (Here, a(m, n) is the inverse Ackermann function.) 

We further improve the algorithm for the most important special case, k - 3: Notice that when xy is an edge 
with weight 3, there is no other edge incident to x or y. Therefore, xy must belong to the maximum weight matching. 
Moreover, if xy has weight 2, there is a maximum weight matching which contains xy. For suppose that xu and yv 
were matched in a instead. Then w(xu) and w(yv) is at most 1 and by exchanging these edges for xy and uv with 
weights w{xy) - 2 and wiuv) > we get a matching with the same or even higher weight. 

Thus, we can include all edges of weight 2 and 3 in the matching and remove the matched vertices together with 
their incident edges. The remaining graph has only unit edge weights, so a maximum cardinality matching algorithm 
can be used. Algorithm by MicaU and Vazirani ||20| runs in 0{m ■\/n) time. Thus, we have 

Theorem 3. The Breakpoint-Median problem for 3 genomes can be solved in Oin yfn) time (in the general model). 

Now, one might still wonder whether there is a still better algorithm for the median problem. We show that 
improving upon our result is very hard, since it would immediately imply a better algorithm for the matching problem, 
i.e., beating the result of Micali and Vazirani [.20J (at least on cubic graphs), which is an open problem for more than 
30 years. 

Biedl fl\\ showed that the maximum matching problem is reducible to maximum matching problem in cubic 
graphs by a linear reduction. This means that we can transform any given graph G with m edges to a cubic graph G' 
with 0{m) edges such that maximum matching in G can be recovered from one in G' in 0{m) time. Thus, any 0{f{m)) 
algorithm for maximum matching in cubic graphs implies an 0(f{m) + m) algorithm for arbitrary graphs. 
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(a) Edge xy (top) should be colored green (this is the only 
missing color at a) and red at the same time (this is the 
missing color at y). We resolve this conflict by subdividing 
edge xy by two new vertices (bottom); we color xu green, 
vy red and uv blue. 



(b) In the second phase, we duplicate 
graph G and connect the corresponding 
vertices with degree 2 as shown in the fig- 
ure. 



Figure 3: Linear reduction of maximum matching in cubic graphs to breakpoint median problem. 



We say that a reduction is strongly linear, if it is linear and both the number of vertices and the number of edges 
increase at most linearly. Such a reduction preserves the running time 0{ f(m, n)) depending on both the number of 
vertices and the number of edges. 

We prove that the Breakpoint-Median problem is equivalent to Matching under linear reduction and to Cubic-Matching 
under strongly Unear reduction. If we write <c for linear and < 5^ for strongly linear reduction, we have 

Matching <[ Cubic -Matching <st Breakpoint-Median <s( Matching. 

The first reduction is by Biedl ||2TI and the last one was shown in Theorem[3](in fact, a reduction to Subcubic-Matching, 
where the degrees are < 3, was shown - this is equivalent to Cubic-Matching under the strongly linear reduction ETJ). 
We now prove the middle reduction (Theorem |4]i. 

Let G be a cubic graph, an instance of the Cubic-Matching problem. The difference between the Cubic-Matching 
and Breakpoint-Median problem is that in Breakpoint-Median, the input multigraph consists of three perfect match- 
ings, i.e., is 3-colorable. However, not all cubic graphs are 3-colorable (take for example Petersen's graph). 



The solution is to color edges arbitrarily and resolve conflicts as shown in Figure 3(a) We can for example color 
the ends of edges at each vertex randomly by three different colors. When both ends of an edge are assigned the same 
color, we color the edge appropriatelly. When the ends have different color, we subdivide the edge into three parts and 



use the third color for the middle edge (see Figure 3(a) 1. Note that the size of a maximum matching in the modified 
graph is exactly one more than the size in the original graph: If xy is matched in the original, xu and vy can be matched 
in the modified graph. If xy is not matched, we can still match uv. 

Now, the modified graph is 3-colorable but not cubic. We remedy this by duplicating the whole graph and con- 



necting the corresponding vertices of low degree as shown in Figure 3(b) As noted above, we may suppose that the 
auxilliary double edges aua[, and are matched, so Mfl„, u'a[^, va,,, and v'aj, are not matched and given the solution 
for the Breakpoint-Median problem, we can recover the maximum matching of G in 0{n) time. The reduction is 
obviously linear, so we have 

Theorem 4. The Breakpoint-Median problem (in the general model) has the same complexity as finding maximum 
cardinality matching in cubic graphs. 

4.2. Median in the mixed model 

In the mixed model, weight of a telomeric adjacency xT v is equal to half the number of genomes that contain xT^- 
If we multiply all weights by 2, we can use the algorithm by Gabow and Tarjan |fT9l for integer weights, so the result 
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of Theorem|2]remains valid also in the mixed model. 

For the median of 3 genomes, an 0(n V") algorithm exists: We observed that we can include all the double and 
tripple adjacencies in the matching. This is also true for the double and tripple telomeric adjacencies (edges of weight 
1 and IV2): If w{xJ^) = IV2, JcTi is a tripple adjacency and no other edge is incident to neither x nor Tj in G. If 
w(xTv) - 1 but the median a contains adjacency xy instead, then w{xy) < 1 and since T^^ can only be incident to x, it 
must be unmatched (or matched by a zero-weight edge) and so we can replace xy by xJ^ in a. 

The remaining graph consists of edges with unit weight and weight '/i. Note however that all the '/z-weight edges 
are of the form x^^ and there is no other edge incident to T i . We use the doubling trick again: we take two copies 
of graph G, and replace all pairs jcT^^, x'J[ by a single edge xx' of unit weight. We can then remove all the telomere 
vertices. The resulting graph will have only unit weight edges and maximum matching exactly twice the size of 
maximum matching in the original graph. 

4.3. Halving problems in the general model 

The same tricks can be used for the halving and the guided halving problem. Recall that in the halving problem, 
given a duplicated genome y, we are searching for a that minimizes the double distance ddia, y) and in the guided 
halving problem, we are in addition given genome p and we are minimizing the sum dd{a,y) + d{a,p). 

Again, we construct graph G, where this time, weight of edge xy is the number of adjacencies among x^y^, x^y^, 
x^y', x^y^ in y and possibly xy in p (in case of the guided halving problem). The rest of the solution is identical 
giving an 0{n ■\/n) algorithm for the guided halving problem. In the halving problem, the degrees of vertices in G are 
at most 2 and after including all the double edges in the solution, the remaining graph consists only of cycles and the 
maximum matching can be found trivially in 0{n) time. 

5. Breakpoint phylogeny 

In the Small-Phylogeny problem, we try to reconstruct the ancestral genomes given a phylogenetic tree and gene 
orders of the extants species while minimizing the sum of distances along the edges of the tree. This problem is NP- 
hard for most rearrangement distances and for most models, this follows trivially from the NP-hardness of the Median 
problem. However, as we have seen in the previous section, this is not the case in the general breakpoint model and 
the complexity of the Small-Phylogeny problem remained open ll5l [T3]| . 

In this section, we prove that the Small-Phylogeny problem is NP-hard also in the general breakpoint model. We 
show that the problem is NP-hard already for 4 species, a special case that we call the Breakpoint-Quartet problem. 

Given four genomes jTi,n2,7Tj,,jT4, the Breakpoint-Quartet problem is to find ancestral genomes a\,a2 that max- 
imize the sum of similarities along the edges of the quartet tree in Figure[4j i.e., the sum 

Sia\,a2) - sim(7ri,ai) + sim(7r2,Q'i) + sim(Q'i,Q'2) + sim(c!'2, ^13) -1- sm\{a2,n4). 




7T2 T^i 
Figure 4: Quartet tree. 

Theorem 5. The Breakpoint-Quartet problem is NP-hard and even APX-hard in the general breakpoint model. 

The proof is inspired by the work of Dees |22] who showed that the following problem is NP-hard: Given two 
graphs G\ - (V, £1), G2 = (V, £2), find two perfect matchings M\ c E\ and M2 Q £2 with the maximum overlap 
My n M2. The problem is NP-hard even when the components in Gi and G2 are just cycles. In our proof, tti U 112 will 
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correspond to E\, n^U 7T4 will correspond to E2, and the unknown ancestors ai,a2 will correspond to the unknown 
perfect matchings Mi, M2. 

Our proof is however much more involved and there are two reasons for this: First, the problem formulation does 
not guarantee that ai c ;7ri U 712 and 02 Q tt^U n^. We will say that a solution ai, Q'2 that satisfies this condition is in a 
normal form. The hard part of the proof is actually showing that we can transform any solution a\,a2 into at least as 
good solution a'^ , that is in the normal form. 

The second major difficulty is that we are maximizing the sum 5(0-1, q;2) instead of just the size of the intersection. 
So a solution with maximum score S {01,02) does not necessarilly maximize the term sim(ai,Q'2), the size of the 
intersection. To overcome these difficulties, we had to modify the edge gadget from the original proof and use a more 
restricted problem for the reduction. 



5.7. Overview of the proof 

The proof is by reduction from the Cubic-Max-Cut problem. Given a graph G, the Max-Cut problem is to find a 
cut of maximum size. We may phrase this as a problem of coloring all vertices in G red or green while maximizing 
the number of red-green edges. (Partition of V into the red part and the green part defines a cut and its size is the 
number of edges with endpoints of different color) In the Cubic-Max-Cut problem, the instances are cubic graphs; 
this variant is still NP-hai'd and APX-hard |23|. 

Let G = (V, £) be a given cubic graph, instance of the Cubic-Max-Cut problem. We will construct genomes tti, 
n2, TTs, and nii such that the maximum cut in G can be recovered from the solution ci, Q'2 of the Breakpoint-Quartet 
problem in polynomial time. 

For each vertex of G, there will be a vertex gadget (see Figure [5(a)] l made of adjacencies of n\ and n2. Let n\ be 
the red matching and 1x2 the green matching. As we will prove later, we may suppose that ffi c tti U 112, so within 
each vertex gadget, ui will contain either the red edges of jii or the green edges of n2- This naturally corresponds to 
a red/green vertex coloring in the Cubic-Max-Cut problem. 



The framed vertices in Figure 5(a) are called "ports" - this is where the three incident edges are attached. For 



each edge of G, an edge gadget is constructed as shown in Figure 5(b) The blue cycles consist of two matchings - the 
adjacencies of jtt, and 714. Again, as we will prove later, we may suppose that 02 Q ttt, U 7T4, i.e., the second ancestor 
consists only of the blue edges. 



port ^ , 



intermediate 
vertex 




auxilliary adjacencies 




(a) Vertex gadget. 



(b) Edge gadget. 



Figure 5: The vertex and edge gadgets used in our reduction and the terminology used for different types of vertices and edges. The red and green 
edges are the adjacencies of n\ and n2, respectively. The cycles made of blue edges can be decomposed into two matchings - the adjacencies of ti-i, 
and n^. 



For future reference, let us state here again the claims to be proved in the form of a lemma: 

Normal form lemma. Let 7ri,7r2,7!'3,^4 be an instance of the Breakpoint-Quartet problem constructed from a 
Cubic-Max-Cut instance as described above. Then any solution ai , a'2 can be transformed in polynomial time into a 
solution Qf'j , a'2 such that S (ffj ,0'^) > S{ai, Q'2) o.nd 

a\ C TTi U 7r2 o.nd a2 - ^3 ^4- 
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Once we prove the Normal form lemma, the rest of the proof is easy: If ai, is any solution in the normal form, 
term sim(7ri,Q'i) + sim(;r2,Q'i) is always the same - we get +6 for each vertex gadget and +6 for each edge gadget. 
Similarly, term sim(a2, tts) + sim(Q'2, ^4) is always the same - we get +9 for each edge gadget. So the score S (ai , ai) 
is maximized, when sim(ai, = lai n 02] is maximized. Let uv be an edge in our graph G from the Cubic-Max-Cut 
problem; if we choose matchings of the same color for both vertex gadgets u and v, then o-i and 02 can only have 
one edge in common within the edge gadget uv (see Figure 6(a) 1. However, if u and v have matchings of different 
color, then we can set adjacencies of a2 so that a[ and 02 have 2 edges in common (see Figure 6(b) 1. When we sum 
up all the contributions, we get S (ai , o-a) = 30n + c, where n is the number of vertices in G and c is the size of the 
cut corresponding to the matching ai , so a polynomial algorithm for Breakpoint-Quartet would imply a polynomial 
algorithm for Cubic-Max-Cut. 




(a) Adjacencies of the first ancestor a[ (red edges) agree with the adja- 
cencies of ni at both vertex gadgets. This corresponds to coloring both 
vertices red in the Cubic-Max-Cut problem. Note that ai and a2 can 
only have one adjacency in common. 




(b) In the first vertex gadget, ai agrees with (red edges) and in the 
second gadget, ai agrees with n2 (green edges). This corresponds to 
coloring first vertex red and the second green in the Cubic-Max-Cut 
problem. In this case, a[ and 02 have two adjacencies in common. 



Figure 6: The dashed edges indicate the underlying vertex and edge gadgets, the blue edges are adjacencies of a2 and the red, green, and yellow 
edges are adjacencies of ai . Here, we assume that a[ and 02 are in the normal form. 



For the APX-hardness, note that for any graph with n vertices, we can easily find a cut of size c > n/2. Let 
a^,a*2 be an optimal solution for an instance of the Breakpoint-Quartet problem and auai a solution such that 
S{a\,al) < (1 -H s)S{ai,a2)- Let both solutions be in the normal form and let c* and c > n/2 be the sizes of the 
corresponding cuts. Then 3Qn + c* < (1 +s){3Qn + c) and c* < (1 +s)c + 30sn < (1 -H61e)c. So a (1 -n e)-approximation 
algorithm for the Breakpoint-Quartet problem gives a (1 -1- 61e)-approximation algorithm for the Cubic-Max-Cut 
problem. 



5.2. Notation, terminology, and other conventions 

Let n = ;7ri U .7r2 U .7r3 U 7r4 be the set of adjacencies present in at least one extant species. We say that an adjacency 
e e a,- is supported, if e e 11; otherwise it is unsupported. 

Let us name the different types of vertices (extremities) and edges (adjacencies) in the following manner: The 
framed vertices in Fig. 5(a) are called ports and edges from tti U JI2 that connect them are called port edges. We use 
the same names also for other (extant or ancestral) adjacencies which are parallel to these. 

Each port consists of two outer extremities called corners and the middle extremity in-between. The set of all 
ports, comers, and middle vertices is denoted by P, C, and M, respectively (f = C U M). The set of intermediate 
extremities between two ports of a vertex gadget is denoted by /. 

The double edges and the two vertices at the top of Fig. 5(b) are auxilliary - they just complete the matchings into 
perfect matchings. 

As the subgraph of an edge gadget with excluded auxilliary and port edges reminds of a ladder, we use the fol- 
lowing "ladder" terminology (see Fig. 5(b) 1: The red-green double adjacencies are the rungs and the blue adjacencies 
are the rails of the ladder. Again, we use the same name for parallel adjacencies. The set of auxilliary extremities is 
denoted by A and the set of ladder extremities is denoted by L. 

We say that uv is an X-Y-edge if u e X and v e Y (X and Y do not have to be disjoint); an X-edge is any edge uv 
such that u e X or V € X. 

In the proof of the Normal form lemma, we will gradually transform a given solution ai, ^2 by exchanging some 
adjacencies in the solution for other adjacencies. The method is analogous to improving a given matching by an 
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augmenting path: An at- alternating cycle is a cycle where edges belonging to a, and edges not belonging to a,- 
alternate. We will say that an Ci, C2 is a non-negative pair of cycles for the solution a\,a2, if C, is an ay-alternating 
cycle and exchanging the matched and the unmatched edges of C, in a, (for i - 1,2) does not decrease the score: 

S{ax eC^ffaeCa) > S{aua2). 

In the figures that follow, we will draw adjacencies of a2 blue and adjacencies of a\ red, green, or yellow: We use 
red and green for edges in the vertex gadgets that are in common with ji\ or 7:2, respectively (since this corresponds to 
choosing the red or green color in the Cubic-Max-Cut problem). We use yellow for the other edges. We use straight 
lines for the actual adjacencies and wavy lines for the suggested adjacencies in non-negative cycles that should be 
included instead. 

In the proof, we will often say 

we may suppose that tlie solution has property V 

as a shorthand for a more precise (and longer) statement 

Given any solution ai,a2, we can transform it to a solution a'^,a'^ with S{a'^,a'^ > S(a\,a2) having 
propery V in polynomial time; in particular, if a\ , a2 is an optimal soluion, a'^ , a'^ is also optimal, with 
property V. From now on, we will assume that the solution has property V. 

5.3. Proof of the Normal form lemma 

First, we focus on the adjacencies that the ancestors ai and 02 have in common. We will show that these may be 
assumed to be supported. This is an important first step, since then we can argue that all the unsupported adjacencies 
contribute zero to the score. 

Proposition 1. We may suppose that all red-green double edges are matched in a\ and all blue double edges are 
matched in 02, i.e., tti r\7T2 Q ai and tt^ r\7T4 c a2. 

Proof. We can replace alternately genome ai or 02 by the median of its neighbors until we converge to a local 
optimum. As we have already proved in the previous section, we may assume that a median contains all adjacencies 
occuring at least twice. □ 

This immediatelly excludes common unsupported A-, /-, and L-edges. 

Proposition 2. We may suppose that ai and 02 do not contain unsupported M-edges. In other words, we may suppose 
that in both a\ and 02, one of the edges in each port is chosen. 

Proof. Let x e M. First, consider the case that xyi e ai and xy2 e a2 are both unsupported. Let p be a neighbouring 
corner vertex; while xyi and xy2 contribute only at most -i-l to the score (if yi - y2), a common adjacency xp would 
contribute H-3; let pzi and pz2 be the actual adjacencies in ai and 0-2; either z\ - Z2 and the common adjacency is 
unsupported, or zi + Z2\ either way, these two edges contribute at most -1-2 to the score; so xpz\y\x and xpz2yiy^ is a 
non-negative pair of cycles. 

Similarly, if one ancestor contains a port edge xp and the other one an unsupported adjacency xy and adjacency 
pz, xpzyx is a non-negative cycle. □ 

Proposition 3. We may suppose that all L-edges are supported. 

Proof. In ci, both L-edges are the rung edges by Proposition [T] Consequently, contribution of any unsupported L- 
edge in a2 to the score is zero. Let t\x G a2 be such an edge. Let t\t2 be the middle rail edge and let t2y be the 
adjacency in 02- If ^2V is unsupported, t\l2y^i\ is an augmenting cycle. Otherwise, if t2y is a rail edge, t\l2y^i\ is 
a non-negative cycle. The last case is that l2y is a rung edge. Let - y, let ^3^4 be the other middle rail edge, and 
let be the adjacency in a2. Again, if ti,z is unsupported, t\l2iziAZxi\ is an augmenting cycle, otherwise it is a rail 
edge and the cycle is non-negative. 

□ 
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It remains to rule out unsupported C-C-edges. 



Proposition 4. We may suppose that there are no common C-edges other than port edges. 



Proof. Let xb be a common C-C edge in ai and 02- In the proof, we will refer to and use the notation of Fig. |7] From 
what we proved by now, we may assume that ai contains the rung edges and {c(d, ami is a common adjacency 
of ai and 02, and either OT2C or m2d is included in 02- 

First, assume the latter case that m2d € 02 (Fig. 7(a) and 7(b) 1. Since the L-edges are supported, either tttc £ 0:2 
(Fig. 7(a) I or both daf^b and belong to a2 (Fig. 7(b) 1. In either case, we can add ladder edges to form an alternating 
b-c path with score +1 that will be a part of our non-negative pair of cycles. Let cz be the unsupported adjacency in 
a2. Either cz i ai and xb . . . czx is a non-negative cycle (see Fig. |7(a)| i, or cz is common edge and we also have to 
exchange some edges in ai . In particular, xbczx and xb . . . czx is a non-negative pair of cycles (see Fig. 7(b) 1. 



7(d) 



Similarly, we can prove the other case when m2C e 02', the non-negative cycle pairs are depicted in Fig. 7(c) and 
It can be easily checked, that the proof also works when extremities x and b belong to the same edge gadget 



(in this case x is c or li and b coincides with z). A C-C-edge connecting two corners of a single port is ruled out by 
Proposition|2] 



Note that if and q'2 have a common z-edge (Fig. |7(b)] and |7(d)| , we create a new common unsupported C-C-edge 

□ 



xz. However, the number of common unsupported C-C-edges is decreased by 1 in all cases. 




(a) Case 1: 02 contains m2d and cz t a[. (b) Case 2: 0-2 contains m2d and cz e o-i n Q'2. 




(c) Case 3: a2 contains m2C and dzto:[. (d) Case 4: Q'2 contains m2C and dzs cei n Q'2. 



Figure 7: Different cases that arise when disposing of unsupported common C-C-edges. The dashed edges represent the underlying edge gadgets; 
adjacencies of qt are blue, adjacencies of a[ are yellow, red, and green. Wavy lines are the new suggested adjacencies that should be exchanged 
for the present ones in the non-negative cycles. 



Corollary 1. We may suppose that all the common adjacencies of the ancestors ai and 02 are supported: ffi riQ'2 £ n. 
More specifically, we may suppose that the only common edges are port edges and rung edges. Consequently, each 
unsupported adjacency contributes zero to the score. 
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We say that ai is uniform at a vertex gadget, if all the port edges in the gadget have the same color (they all agree 
with either the 7i\ edges or the JI2 edges). Next, we prove that a\ may be assumed uniform at all gadgets. Such an 
ancestor a\ directly corresponds to a cut in G. 

Here, we use the fact that G is cubic: Imagine that G was a complete bipartite graph Kn „ with one more vertex 
connected to all the other vertices. Then our reduction would not work, since the optimal ancestors would color one 
bipartition red, the other green and the extra vertex half green half red (i.e., half of the ports would be green and the 
other half red). 

First, let us characterize how the non-uniform gadgets look like. 
Proposition 5. The following are equivalent: 

• a\ is not uniform at a vertex gadget 

• there is one unsupported I-edge in ai incident to the vertex gadget 

• there is one unsupported C-edge in a\ incident to the vertex gadget 

Proof. Let ai be non-uniform at a vertex gadget. Without loss of generality, let two of the port edges be green and 



one be red (see Fig. 8(a) I. Denote r the red and gi and g2 the green edges, such that gi is closer to r (as in Fig. 8(a) 1. 
The edge incident to the intermediate extremity between r and g\ is an unsupported /-edge. 

Obviously, if two neighbouring extremities in a vertex gadget are incident with unsupported edges, there is an aug- 
menting cycle, so we may suppose that the intermediate edge between gi and g2 is green and one of the intermediate 



edges e or / in Fig. 8(a) belongs to a\ ; the other corner has an unsupported C-edge. 

Conversely, if there is an unsupported /-edge or C-edge, the neighbouring ports cannot have edges of the same 
color (this would imply two neighbouring extremities with unsupported edges in ai). □ 



(a) A non-uniform ancestor 
ai at a vertex gadget. 



(b) The non-uniform gadgets are connected by un- 
supported /-edges and C-edges. 






(c) A non-negative cycle used for disposing of 
non-uniform gadgets and unsupported edges in ai . 



Figure 8: Non-uniform ancestors at a vertex and a way how to remedy them. 

We are ready to prove the Normal form lemma. 

Proposition 6. We may suppose that in each vertex gadget, the port edges of ay are either all red or all green. In 
particular, we may suppose that aj C ttj U 712. 

Proof. We prove that for each vertex gadget, we may simply look at the three port edges and choose the color by 
majority vote. In the previous proposition, we have proved that non-uniform gadgets have exactly two unsupported 



edges so they form cycles as in Fig. 8(b) Fig. 8(c) shows the non-negative cycle that we get by including the majority 



vote color edges. In each vertex gadget, we may lose 1 point for switching the port edge (if this was a common edge), 
but we get 1 extra point for increasing the number of supported edges. □ 



Proposition 7. We may suppose that q'2 £ ^3 U n^. 
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Proof. The only remaining edges in 02 that may be outside 713 U 1:4 are rung edges and unsupported C-C-edges. If 02 
contains a rung edge, then in the adjacent port, one corner is covered by a port edge, but the other comer is not and 
we have an unsupported C-C-edge. Conversely, it is easy to see that if 02 contains a C-C-edge then in the attached 
edge gadget, either both rung or both middle rail edges are in 02 and there is a C-C-edge incident to a corner of the 
opposite port. So the edge gadgets with C-C-edges form cycles and all the rung edges are in these edge gadgets (see 
Fig.|9]l. We will fix both at the same time. 

We can join the two comers in an edge gadget by a non-negative alternating path (see Fig.|9]i; we can lose 1 point 
for destroying a common adjacency of ai and a2, but we gain 1 point for increasing the number of supported edges 
in a2- □ 



+1 




+1 -1 +1 +1 



Figure 9: Example of three edge gadgets connected in a cycle by unsupported C-C-edges. We can join two corners with unsupported C-C-edges 
in an edge gadget by a non-negative path. Note that we also get rid of the blue rang edges in the top and right edge gadgets at the same time. 

This concludes the proof of the Normal form lemma and thus also the proof of NP-hardness and APX-hardness of 
the Breakpoint-Quartet problem. 



6. Conclusion 

In this paper, we have settled the open problems concerning the computational complexity of different rearrange- 
ment problems in the breakpoint models. There are three intriguing questions in this area which remain open. The first 
two are of theoretical interest and are related to approximability of the Small-Phylogeny problem, the third question 
is more practical: 

1 . How well can we approximate Small-Phylogeny? For example, Breakpoint-Quartet problem can be easily 
formulated as an integer linear program (we can use different variables for the edges only in ai, only in a2, and 
in the intersection ai n 02). Its relaxation might lead to an algorithm with some approximation ratio. 

2. In the Steinerization approach to ancestral reconstruction, we repeatedly replace the ancestral genomes by 
medians of genomes in the neighboring nodes of the tree until we converge to a local optimum. Despite the fact 
that this is the most common approach to ancestral reconstruction (also in the other models) and that preliminary 
experiments with simulated data suggest that this heuristic performs very well, no guarantees are known for the 
method (in any model). 

3. Finally, the motivation behind the general breakpoint model is that we can solve the median problem in poly- 
nomial time. Using the Steinerization method, we can also get very good solutions of the Small-Phylogeny 
problem rapidly. The question is, how useful are these solutions in practice. 
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